2 research outputs found
Dynamic Data Citation Service-Subset Tool for Operational Data Management
In earth observation and climatological sciences, data and their data services grow on a daily
basis in a large spatial extent due to the high coverage rate of satellite sensors, model calculations, but
also by continuous meteorological in situ observations. In order to reuse such data, especially data
fragments as well as their data services in a collaborative and reproducible manner by citing the origin
source, data analysts, e.g., researchers or impact modelers, need a possibility to identify the exact
version, precise time information, parameter, and names of the dataset used. A manual process would
make the citation of data fragments as a subset of an entire dataset rather complex and imprecise to
obtain. Data in climate research are in most cases multidimensional, structured grid data that can
change partially over time. The citation of such evolving content requires the approach of "dynamic
data citation". The applied approach is based on associating queries with persistent identifiers. These
queries contain the subsetting parameters, e.g., the spatial coordinates of the desired study area or the
time frame with a start and end date, which are automatically included in the metadata of the newly
generated subset and thus represent the information about the data history, the data provenance,
which has to be established in data repository ecosystems. The Research Data Alliance Data Citation
Working Group (RDA Data Citation WG) summarized the scientific status quo as well as the state of
the art from existing citation and data management concepts and developed the scalable dynamic
data citation methodology of evolving data. The Data Centre at the Climate Change Centre Austria
(CCCA) has implemented the given recommendations and offers since 2017 an operational service
on dynamic data citation on climate scenario data. With the consciousness that the objective of this
topic brings a lot of dependencies on bibliographic citation research which is still under discussion,
the CCCA service on Dynamic Data Citation focused on the climate domain specific issues, like
characteristics of data, formats, software environment, and usage behavior. The current effort beyond
spreading made experiences will be the scalability of the implementation, e.g., towards the potential
of an Open Data Cube solution
Application of machine learning to weather-triggered hazards and damages in alpine territory
Die Anwendung von entscheidungsbaumbasierten Machine Learning Methoden im Umfeld der Extremereignisvorhersage ist herausfordernd. Ungleich verteilte binĂ€re Klassifikationsaufgaben bedingen einer passenden GegenmaĂnahme um die VorhersagequalitĂ€t des Modells in Bezug auf die Minderheitsklasse (Extremereignis) zu verbessern. Wir evaluieren das Potential unterschiedlicher Techniken [Branco, Torgo und Ribeiro, 2015] wie pre-processing, special purpose learning methods und post-processing durch
ihre Anwendung auf random forests und gradient boosted trees. K-nearest
neighbor Schwierigkeitsanalysen und unterschiedliche Validierungsmetriken wurden fĂŒr zehn synthetische DatensĂ€tze ausgewertet und liefern die
besten Ergebnisse fĂŒr special purpose learning Methoden. Bezugnehmend
auf Enigl u. a., 2019 analysieren wir die jĂ€hrlichen Summen von Extremereignissen fĂŒr sieben Kategorien und erweitern den Schadensdatensatz mit âzeitlich unabhĂ€ngigenâ PrĂ€diktoren aus Terrain-, Vegetations-, Boden- und Geologiedaten. Der gesamte Datensatz wird auf Rutschungsereignisse reduziert und eine Schwierigkeitsanalyse fĂŒr unterschiedliche geologische Domainen wird durchgefĂŒhrt. Die Analyse zeigt den hohen Grad des Ungleichgewichtes und der Schwierigkeit welche, bei vergleichbaren synthetische DatensĂ€tzen, die Grenze valider Auswertemöglichkeiten darstellen. Ein umfangreicher SuszeptibilitĂ€tsmodellierungsversuch fĂŒr die Ereigniskategorie Hangrutschung wird mit skalierten random forest (SXRF) und skalierten
gradient boosted tree (SXGBT) Modellen aus dem XGBoost Framework [T.
Chen und Guestrin, 2016] durchgefĂŒhrt. Hierbei bezeichnet der Term âskaliertâ die balancierte Gewichtung der MinoritĂ€ts- und MajoritĂ€tsklasse in der Kostenfunktion unter BerĂŒcksichtigung ihres globalen Ungleichgewichtes.
SXRF liefert schlechtere Ergebnisse als SXGBT. Die 5-fach Kreuzvalidierung
von SXGBT mit konsistente SensitivitĂ€ten von ⌠0.75 und den FlĂ€chen unter den Grenzwertoptimierungskurven von ⌠0.8 deuten auf die Robustheit des Modells und der gewĂ€hlten PrĂ€diktoren hin. Die wichtigsten numerischen PrĂ€diktoren sind âHangneigungâ (unter BerĂŒcksichtigung korrelierter PrĂ€diktoren), âDistanz zur nĂ€chstgelegenen StraĂeâ, âDistanz zur nĂ€chstgelegenen geologischen Grenzschichtâ und der OberflĂ€chenparameter
âBodendichteâ. Die wichtigsten binĂ€ren one-hot kodierten kategorischen
PrĂ€diktoren sind die Landbedeckungsklasse âWaldâ und die geologischen
DomĂ€nen âAustroalpine Einheitâ und âSiliciklastikaâ.The application of tree-based machine learning methods in the field of
hazard event prediction is challenging. The imbalanced binary classification
task requires suitable countermeasures in order to enhance the models
predictability of the minority class (hazard event). We estimate the potential of different techniques [Branco, Torgo, and Ribeiro, 2015], such as
pre-processing, special purpose learning methods and post-processing, by
applying them on random forests and gradient boosted trees under well
known synthetic conditions. The special purpose learning methods outperform the pre- and post-processing approaches, whereto k-nearest neighbor difficulty analyses and various performance metrics for ten synthetic data sets are evaluated. Following up on Enigl et al., 2019 we analyse the yearly sum of hazard events for seven categories and expanded the Austrian
hazard event space by âtime independentâ features derived from terrain,
soil, vegetation and geological data. The data is further filtered for slide
events on which we perform a difficulty analysis for different geological
domains. Thus, the analysis reveals the degree of imbalance and difficulty,
at which comparable synthetic data sets tend to be in limbo of viability. Nevertheless, an extensive modeling approach for the hazard category slide is performed using scaled random forests (SXRF) and scaled gradient boosted trees (SXGBT), both implemented in the XGBoost framework [T. Chen and Guestrin, 2016]. Whereat, the term âscaledâ refers to the fact that weights are balanced in the cost function for minority and majority instances in relation to their global imbalance ratio. SXRF is outperformed by SXGBT
and 5-fold cross validation scores indicate the robustness of the model
with consistent sensitivity scores of approximately 0.75 and areas under the
receiver operator curve of approximately 0.80. The most important numeric
features are âslopeâ (considering correlated features), âminimum distance to
streetâ, âminimum distance to nappe boundaryâ and the âtopsoil physical
property bulk densityâ. The most important binary one-hot encoded categorical features are the landcover class âWoodyâ and the geological domains âAustroalpine Unitsâ and âSiliclastic Rocksâ. Non-linear tree-based machine learning methods may further improve data-driven models for susceptibility mapping of spatially non-persistent hazard events. Nevertheless, they depend heavily on the quality of the underlying feature and hazard event data